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(57) Abstract: System, process and software arrangement produces high resolution, high accuracy, ordered, genome wide haplo- 
typed maps from single molecule based approximate ordered maps and the location of genes responsible for genetic diseases are 
determined by performing an association study using a population of genome wide hapk) typed maps. This can also be used with 
Optical Mapping data to assemble a genome wide haplotyped restriction map based on multiple distinguishable restriction enzymes. 
This invention can also be used with any other single molecule process that can produce approximate ordered physical map from 
randomly broken DNA pieces of a particular genome. 
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CROSS-REFERENCE TO RELATED APPLICATION 

The present application claims priority from United States Patent 
Application No. 60/427903, filed November 20, 2002, the entire disclosure of which 
incorporated herein by reference. 

FIELD OF THE INVENTION 



arrang 



The present invention relates to systems, process and software 
its for producing genome wide haplotyped maps. More particularly, the 



present invention relates to systems, process and software arrangements for producing 
genome wide haplotyped maps from single molecule based approximate ordered maps 
and locating genes responsible for genetic diseases. 

BACKGROUND OF THE INVENTION 

One of the goals of genomics is to locate genes responsible for genetic 
diseases. The traditional approaches to locating such genes are generally based on 
finding single polymorphic genetic markers that are co-inherited with the disease with 
such regularity that it can be assumed that the single disease-causing gene is located 
very close to the marker. These approaches are traditionally divided into two classes, 
Linkage Analysis, as described in Neil J. Risch, "Searching for Genetic Determinants 
in the New Millennium" Nature, 405, June 2000, the disclosure of which is 
incorporated herein by reference, and (single marker) Association Studies as 
described in Thomas G. Schulze and Francis J McMabon, "Genetic Association 
Mapping at the Crossroads: Which Test and Why? Overview and Practical 
Guidelines" American Journal of Medical Genetics (Neuropsychiatric Genetics) 
volume 1 14, pages 1-11 (2002)., the disclosure of which is incorporated herein by 
reference. Both these conventional approaches typically only track a single marker, 
and therefore do not work for multi-genic diseases, which are now believed to 
predominate in all undiscovered disease genes. In addition, both approaches 
generally use complex statistical process to compensate for spurious correlations that 
can occur due to population stratification and other unknown and non-random genetic 
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variation across the genetic samples studied, which.almost always requires samples fiom. . 
related individuals. 

Both of these types of problems could be obviated by using genome wide 
maps of (polymorphic) genetic markers* If all possible polymorphic genetic markers are 
5 available across a large enough set of samples, it is easy to statistically compensate for 
spurious correlations by randomly sampling large numbers of markers that most likely 
are not related to the disease of interest In addition, it is possible to locate all genes 
involved in multi-genic diseases. The estimate of a statistically sufficient sample size for 
this problem remains elusive as it depends on the complexity of multi-genic disease with 

10 an unknown structure. 

The down side is that the cost of genome wide maps of polymorphic 
markers is very high even in the current post-genomic era. For example, the most 
common polymorphic marker, a SNP, is expected to cost about 50 cents per marker in the 
near future. However, there are estimated to be 10 million such markers over the entire 

IS human genome, and a realistic association study would require at least 1000 samples to 
be tested for each of such 1 0 million SNPs. Fortunately, recent results show the presence 
of significant linkage disequilibrium, as described in David Altshuler et aL, 'The 
Structure of Haplotype Blocks in the Human Genome" , Science, 296, June 2002, the 
disclosure of which is incorporated herein by reference, suggesting that the human 

20 genome can be broken into haplotype blocks of average size of 30Kb, with all 

polymorphic markers within a single haplotype block being nearly 100% correlated with 
each other. In addition each such haplotype block appears to have an average of only S 
alleles (genetic variations). Thus, on average, 3 carefully selected SNPs should be 
enough to identify all genetic variation within each haplotype block, and hence testing for 

25 about 300,000 carefully selected SNPs should be enough to identify all genetic variation 
in a single DNA sample. Thus, the cost of genome wide maps of polymorphic markers is 
significantly reduced. One small inconvenience of linkage disequilibrium is that it is not 
possible to narrow down the location of the diseases causing gene any more closely than 
identifying the haplotype block in which it is located. 

30 One problem with attempting to exploit linkage disequilibrium is that in 

order to preserve all genetic information the genome wide map must distinguish the two 
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_ paraxial DNA strands jWhesaippJe. (except, of course, for the .Y_chrQraosome)^ so thaL 
the allele of each parental DNA strand of each haplotype block can be uniquely 
identified Such a genome wide map is referred to as a haplotype map (or haplotype 
block map), and would likely be two maps per chromosome, except for the Y 
5 chromosome. Unfortunately, the most inexpensive SNP genotyping process, whether 
using assays or array hybridization, do not track the phasing between neighboring SNPs. 
For a genotyping process to be able to track phasing between neighboring polymorphic 
markers, it should ultimately be able to test single DNA fragments containing 2 or moxe 
polymorphic markers in a single test or needs to simultaneously test groups of related 

10 DNA samples (e,g. trios of father-mother-child) to distinguish the parental alleles which 
would increase the total cost of the association study, as well as reducing the applicability 
for patients that do not have parental DNA available for analysis. 

SUMMARY OF THE INVENTION 
The present invention uses single molecule maps, such as generated by 

1 S Optical Mapping, and is generally based on statistically combining single molecule 

restriction maps of long genomic DNAs of average length of about 1 Mb; such a segment 
in human typically contain more than 2 heterozygous polymorphic markers. Thus, it is 
possible according to the present invention to combine this raw optical mapping data into 
genome wide haplotype restriction maps. In addition to being able to generate genome 

20 wide haplotype restriction maps, the exemplary embodiment of the system, process and 
software arrangement according to the present invention has two additional advantages 
over SNP based approaches. First, restriction maps can reveal not only SNPs that 
coincide with the restriction sites, but also other polymorphisms such as micro-insertions 
and deletions, global rearrangements orhemizygous deletions. Second, since single 

25 DNA molecule segments can be mapped using fluorescent microscopy, the exemplary 
approach is capable of very high throughput (limited primarily by the digital camera 
throughput) using very little DNA, and having a fraction of the comparable cost for the 
least expensive SNP approaches. The commercial cost estimated by the end of 2003 is 
the equivalent of 2 cents per (phased) genetic marker, and such cost is expected to drop 

30 by at least another order of magnitude as faster/cheaper computers and digital cameras 
become available over time. 
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. ... The iaw single moleeulemap data -can. consist of approximate restrictions 
maps of random pieces or segments of genomic DNA with average length of currently 
about 1-3 Mb. Bach approximate map may be derived from a single such segment of 
uncloned DNA molecule, directly derived from a blood sample. The map is approximate 
5 in that it has a number of errors, including sizing errors in the measurement of fragment 
size or distance between the restriction sites (typically 10% for a 30Kb fragment for 
Optical Mapping), missing restriction sites (typically 20% of restriction sites are false 
negatives), false restriction sites (typically 10% of restriction sites are false positives), 
and missing small fragments (typically most fragments under 1Kb are missing). 

10 Algorithms to assemble such approximate maps into larger and highly accurate maps 
using redundant data (SOX is typically sufficient) have been used successfully to 
construct genome wide (non-haplotype) restriction maps of micro-organisms such as E. 
Coli and P. Falciparum as well as BAC clones of human DNA, as described in Lim A, 
Dimalanta ET, Potamousis KD, Yen G, Apodoca J, Tao C, Lin J, Qi R-, Skiadas J, 

15 Ramanafhan A, Perna NT, Plunkett G 3rd, Burland V, Man B, Hacket J, Blattner FR, 
Anantharaman TS, Mishra B, Schwartz DC. "Shotgun optical maps of the whole 
Escerichia coli 0157:H7 genome", Genome Research, 11(9): 1584-93, Sep 2001; 
Giacalone J, Delobette S, Gibaja V, Ni L, Skiadas Y, QiR, Edington J, Lai Z, Gebauer D, 
Zhao H, Anantharaman T, Mishra B, Brown LG, Saxena R, Page DC, Schwartz DC. 

20 "Optical mapping of BAC clones from the human Y chromosome DAZ locus," Genome 
Research, 10(9); 1421-9, Sep 2000 and Lai Z, Jing J, Aston C. Clarke V, Apodaca J, 
Dimalanta ET, Carucci DJ, Gardner MJ, Mishra B, Anantharaman TS, Paxia S, Hoffinan 
SL, Venter JC f HuJffEJ; Schwartz DC. "A Shotgun Sequence-Ready Optical Map of the 
Whole Plasmodium Falciparum Genome," Nature Genetics, 23 (3): 309-313, Nov 1999 

25 andBud Mishra and Laxmi Parida, 'Tartitioning K clones : Ihapproximability Results and 
a Practical Solution to the K-Populatians Problem", RECOMB98 pages 192-201, 1998, 
the entire disclosures of which are incorporated herein by reference. The algorithms used 
can be based on Maximum Likelihood scoring using a Bayesian prior, as disclose in 
Anantharaman T, Mishra B, and Schwartz DC, "Genomics via Optical Mapping II: 

30 Ordered Restriction Maps," Journal of Computational Biology, 4(2):91-118, Summer 
1999, the disclosure of which is incorporated herein by reference. Similar to other 
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.gengmis mapping techniques, these algorithms construct only a single consensus map Sox - 
each human chromosome pair. 

The system, process and software arrangement according to the present 
invention can use any ordered maps of small pieces of DNA from the Genome, provided 
5 the markers are polymorphic and the error rates are within the bounds listed in the claims, 
e.g., data generated by Optical Mapping* This invention can then be used to construct 
genome wide haplotype maps from any single molecule mapping data and then applied to 
large-scale association studies to locate the genes responsible for specific genetic 
diseases. 

1 0 Optical Mapping, as described in International Application No. 

PCT/US01/30426, the entire disclosure is incorporated herein by reference, can be used 
to generate approximate restrictions maps of pieces of single DNA molecules at very low 
cost and high throughput Uncloned DNA (e.g., directly extracted flom a blood sample) 
can be randomly sheered into 1-2 mega base pieces and attached to a suitable substrate, 

1 5 where it is first reacted with the restriction enzyme, then stained with a suitable 

fluorescent dye. The restriction enzyme cleavage sites show up as breakages in the DNA 
under fluorescent microscope. Tiled images of the surface may be collected 
automatically using a fluorescent microscope with a computer controlled x-y-z sample 
translation stage. The images are analyzed automatically by a computer to detect the 

20 bright DNA molecules and to locate the breaks in these molecules corresponding to the 
restriction enzyme cleavage sites. The approximate size of the distance between 
restriction sites can be estimated based on the integrated fluorescent intensity relative to 
that of a standard DNA fragment (typically some small cloned piece of DNA, for 
example some Lambda Phage Clones) that has been added to the sample. The software 

25 arrangement by the computer uses the known length and restriction map of the standard 
to recognize it in the data. Errors can be introduced by die physical process, such as non- 
uniform staining, failure of restriction enzyme to cleave, random breakage in the DNA 
molecule that cannot be distinguished from a cleavage site, and errors in the image 
processing that may introduce additional cleavage sites (due to non-uniform staining) or 

30 miss some cleavage sites that produce very small gaps, or accidentally combine two DNA 
pieces into a single larger piece. These errors include, e.g., sizing errors in the 
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measurement offragment size Qr distance between rpstriction sites.(typically L0%for a _ _ 



30Kb fra 



it), missing restriction sites (typically 20% of restriction sites are false 
negatives), false restriction sites (typically 10% of restriction sites are false positives), 
and missing small fragments (typically most fragments under 1 Kb are missing). Optical 
Mapping relies on redundant data to recover from errors. Approximately 5 Ox redundancy 
is preferred to assemble genome wide maps and recover from most errors (except for a 
residual sizing error) with high confidence. 

A single restriction map generally detects only a limited number of 
polymorphic markers, namely those SNPs that coincide with the restriction site and 
insertions/deletions that are large enough to result in significant changes in the distance 
between restriction sites. The system, process and software arrangement according to the 
present invention overcomes this limitation, since even considering SNPs alone, enough 
coincide with restriction sites, that a small number (2-10) of restriction maps maybe 
sufficient to identify the alleles of most haplotype blocks, and thus contain at least as 
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1 5 much information as about 3 



(phased) SNPs. 

The exemplary embodiment of the present invention relates to systems, 
process and software arrangements for producing genome wide haplotyped maps. More 
particularly, the present invention relates to systems, process and software arrangements 
for producing genome wide haplotyped maps from single molecule based approximate 
ordered maps and locating genes responsible for genetic diseases. 

Other and further objects, features and advantages of the present invention 
will be readily apparent to those skilled in the art upon a reading of the description of 



preferred embodiments which follows. 

DETAILED DESCRIPTION 

Hie present invention relates to systems, process and software 
arrangements for producing genome wide haplotyped maps. More particularly, the 
present invention relates to systems, process and software arrangements for producing 
genome wide haplotyped maps from single molecule based approximate ordered maps 
and locating genes responsible for genetic diseases. 

The prevalence of SNPs that coincide with restriction sites can be 
estimated quite reliably by examining the set of known SNPs and ft>r each possible 
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. .ratridion enzyme detennhmg if there is a.restription jite. at.the.SNP location that would 
not cut for one of the SNP variants. Such a SNPs site can be referred to as a polymorphic 
restriction site relative to the restriction enzyme considered. The number of such 
polymorphic restriction sites for each of the 269 distinct restriction enzymes is shown in 

5 Table 1 for a selected subset of restriction enzymes (under the column "Poly Sites")- 
Additional columns adjust the raw number to account for the unknown SNPs that have 
not yet been detected, but would show up in a restriction map. The last column of the 
tables assumes that the total number of SNPs is 10 million and is linearly extrapolated 
from the number of polymorphic restriction sites and known SNPs (1 .28 million). In 

0 addition, some of the polymorphic restriction sites may not be detected by Optical 
mapping since they are too close to another restriction site to be resolved by Optical 
mapping. It can be assumed that any polymorphic restriction site within 400 base pairs of 
another restriction site should not be detected and estimated the fraction of restriction 
sites that may be lost on average by examining the distribution of restriction fragment 

S sizes from the sequence of chromosome 21 published by NIH (as shown in column 
5 "Miss-rate" in Table 1) and extrapolating this rate to the entire human genome. The last 
column of Table 1 reflects such adjustment Optical Mapping generally works on human 
DNA if a methylation insensitive restriction enzyme is used. There are 8 such known 
restriction enzymes, which are shown in Table 1 marked with an asterix under the 

0 1 "Methyl" column along with a small selection of other restriction enzymes. The ability 
to use any particular restriction enzyme can be further restricted by the smallest fragment 
size the Optical mapping can size reliably. This currently may limit Optical Mapping to 
restriction enzymes that produce average fragment sizes of 15Kb or more, and limit the 
use of the last two restriction enzymes in Table 1 (e.g., Pad and Swal). However, the 

5 sizing accuracy would improve sufficiently to allow maps with average fragment size of 
about 2.0 Kb to be generated. This would allow the use of any of the last six methylation 
insensitive restriction enzymes shown in Table 1 . One shows how much overlap there is 
between the information provided by different restriction enzymes. Preferably, if there is 
no overlap, it is possible to simply add the numbers in the last column of Table 1 to 

D estimate the total number of SNPs that can be detected by using multiple restriction 
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enzymes. In this case, .the ^..m^y^onjngensi^ enzymes.that are usable- - 

by Optical Mapping can detect approximately 200,000 SNPs. 
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Table 1; restriction sites coinciding with SNPs 

In one exemplary embodiment of the present invention, algorithms can be 
5 used to assemble genome wide haplotype maps from Optical mapping data. The map 
assembly algorithms used to assemble non-haplotype maps from Optical Mapping data 
may be based on Bayesian/Maximum-Likelihood estimation, as disclosed in 
Anaatharaman TS, Mishra B, and Schwartz D.C., "Genomics via Optical Mapping HI: 
Contiging Genomic DNA and variations/' ISMB99, 7: 18-27, Aug 1999, the disclosure of 

10 which is incorporated herein by reference. The systems, processes and software 

arrangements of the present invention for assembling haplotype maps from, e.g., Optical 
Mapping data, extend these algorithms to handle a mixture hypothesis of pairs of maps 
for each chromosome, corresponding to the correct restriction map of the two parental 
chromosomes, and each single-molecule Optical Map can be assumed to have been 

1 5 derived from one of these two hypothesis maps at random. For the sake of simplicity, it 
can be assumed that all data is derived from a single chromosome so that only one pair of 
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_ hypqtiierisjnj^s, e.g., HI and H2_arause&. .The general casamay beatriviai extension - 
of this special case. It is then possible to use a probabilistic model of the errors in the 

Optical Maps to derive conditional probability density expression f(p \ H t ) and 

f{D | H 2 ) that any particular approximate restriction map D is derived with errors, and 

5 some suitable breakage from correct chromosome maps HI and H2. The goal is to 

compare different possible HI and H2 to find the best ones. Hence, it is possible to apply 
Bayes rule, Equation (0.1) (with M number of approximate restriction maps in the input 
data): 

(0.1) /(iTpFJA--^ 

10 The first term on the right side is the prior probability of any hypothesized 

chromosome maps HI and H2. Generally, no prior information is available except that 
the average restriction fragment size is typically approximately known, and it is known 
that HI and H2 will be very similar. Polymorphic restriction sites are typically rare 
around 4% (see last column in Table 1), but can range from 27% (Bpul831I) to 1.8% (for 

IS SdiT) of all restriction sites, depending on the restriction enzyme involved, and can be 
estimated quite reliably for the restriction enzyme used from the full version of Table 1 . 
For restriction fragment length polymorphisms (RFLPs) there is difficulty estimating how 
frequently they occur, but it is always possible to estimate a probability (say 4%), and 
iterate the process if the final maps HI and H2 that maximize the probability density do 

20 not confirm this value. After the first haplotype map with a particular restriction enzyme 
has been constructed, reliable estimates should carry over to additional maps. Thus, 
establishing the expression for the prior term is usually possible, and can be further 
simplified to include only the low prior probability of polymorphic restriction sites or 
restriction fragment lengths with negligible loss in accuracy. 

25 For the conditional probability term, it can be assumed that each 

approximate restriction map (data input) is a statistically independent sample from the 
genome and that the associated mapping errors are independent, and that molecules were 
derived from either parental chromosome with equal likelihood. Hence, the following 
expressions can be obtained: 
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Thus, Hie conditional probability terms are reduced to combinations of the 
non-haplotype case / (Z) | H) involving just one hypothesized map at a time. This 

conditional term can be provided as a summation over all possible (e.g., mutually 
5 exclusive) alignments between the particular D and H, and for each alignment the 
probability density can be based on an enumeration of the map errors implied by the 
alignment. In order to obtain a reasonably fast evaluation of the probability densities 
summed over all alignments is the use of a dynamic programming recurrence equation, 
which is equivalent to factoring out the common sub-expressions of the probability 
10 densities across the different alignments. First, a single arbitrary alignment between a 
particular D and H should be considered. For the sake of convenience, the following 
discussion drops the subscript j from D and m). The data map D can be described by a 
vector of locations of restriction sites D [J = 0 • • * m + 1] s where for convenience the first 

entry D[o] is 0 and the last entry D [m + 1] is the total size of the map. For rotational 

15 convenience, it is possible to also refer to the entries of this array as 2> 7) J = 0- ■ ■ m +1 

which should not be confused with the distinct data maps D J9 j=l"*M referred to 

previously. Similarly, the hypothesis map H can be described by a vector 

jy[/ = 0-"i\T+l] also denoted as // /t J = <)••• JV+l. An arbitrary alignment can be 

provided as a list of pairs of restriction sites from H and D that describe which restriction 

20 site from H is aligned with which restriction site from D. According to the example 
shown in Figure 1, the alignment consists of 4 aligned pairs (4,2) (5,2) (IJ) and (P,Q). 
All restriction sites in H or D need be aligned. For example, between aligned pairs (I, J) 
and (P, Q), there is one misaligned site on H and D each, corresponding to a missing site 
(false-negative) and extra-site (felse-positive) m D. In this alignment, a true small 

25 fragment between sites 4 and 5 in H are missing from D, which is shown by aligning both 
sites 4 and 5 in H with the same site 2 in D. If two or more consecutive fragments in H 
are missing in D, this can be described by aligning all sites for the missing fragments in H 
with the same site in D (rather than showing only the outermost of this set of consecutive 
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sites jri H aligned with D, for example).. Thi$ convention proyides that for. each missing., 
fragment two consecutive sites in H (those flanking the missing fragment) can be aligned 
with the site in D in which the fragment is presumed missing. 

The expression for the conditional probability density of any alignment 
5 such as this can be provided as the product of a term corresponding to the region of 
alignment between each pair of aligned sites, plus one term for the unaligned region at 
each end of the alignment. For an aligned region that is not a missing fragment (e.g. (I, 
J) and (P, Q), such that P>I and Q>J), this probability density can be denoted by a 
function of the form FAjj^ , which may depend on the specific errors in the 

10 corresponding region of the alignment between D and H. Similarly for an aligned region 
that corresponds to a consecutive number of missing fragments, the probability density 
may be denoted by a function FM fJi (e.g. (I, J) and (1+1, J) can correspond to FM IJH ). 



For the probability density of the unaligned portion on the left and right end of each 
alignment, UR JpJ can be used on the right end if (I, J) is the rightmost aligned pair, and 

1 5 ULjj on the left end if (I, J) is the leftmost aligned pair. 

Their exact form does not affect the complexity of the system, process and 
software arrangement according to the present invention, as long as they can be evaluated 
in constant time. The form of these functions for a good Optical Mapping data model is 
shown in example 1 in equations (0.7X0.8) and (0.9). 

The probability density of a particular alignment is the product of each of 
the terms FA If3Pa , PM f , UL JgJ , UR U that apply to that alignment The probability 

density of any alignment can be separated into the product of those terms on either side of 
any particular alignment pair (I, J). This forms the basis of a two-dimensional recurrence 
using an array AR TJ , where 7 = l-"iV r ,/=0---w+l. AR frf represents the sum of the 



25 probability densities of all those alignments between the part of H to the right of site I, 
and the part of D to the rigjit of site J, for which (I, J) is the leftmost aligned pair. Thus, it 
is possible to derive the recurrence for AR U in Equation (0.3). 
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This arrayi;an_thmbe used to compute Jhe_ total probability density by. 
summing over every possible leftmost alignment pair (I, J) as shown in Equation (0.4). 

(0.4)/(2>|^) = XI]^/^ 

Equations (0.3X0.4) are able to sum up all of the alignments in time 

5 proportional to rrij N \ where m } is the number of restriction sites in Dj andNisthe 

number of restriction sites in H, If an acceptable a good approximate location of the best 
alignment between D and H is known, which is possible if the conditional density has 
been previously evaluated for a similar H or with the help of geometric hashing 
algorithms, a constant width band of the recurrence array AR JfJ should be evaluated 

10 which can be performed in time proportional to 2myA 3 , where A is the number of 

restriction sites representing the width of the band. A fixed value of A = 8 works well 
for error rates typical in Optical Mapping data. 

The computationally expensive part is the search over possible correct 
maps HI and H2. First, assuming that both HI and H2 is very similar, and a single 

15 hypothesis H that best matches all data can be reached far. This first stage is similar to 
the case of non-haplotype map assembly. Then the maps can be heuristically and quickly 
assembled into larger contigs using a similar and approximate dynamic programming 
scheme to obtain the best alignment between any two approximate maps D. If this 
alignment is good enough, the maps can be combined into a larger map (contig map) by 

20 averaging the two maps in their overlap region. This heuristic stage relies on geometric 
hashing to quickly identify the maps that overlap, and the complexity of this stage can be 

determined by the geometric hashing and is estimated to be approximately o{M D * n j 

hi 

where M D ^^Tm, is the total number of fragments in the Optical Mapping data. 

Geometric hashing can have sub-quadratic complexity in the worst case and the 
25 complexity may be as good as linear. The actual time for this state of computation is 
usually small compared to the time for the remaining search over possible HI and H2, 
unless the genome being used is much larger than the human genome. The resulting 
contig maps can be used as a basis for an initial hypothesis H, which should then be 
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refined by trying to add prjdelets .testriction sites jnd.hy adjusting the distance between- 
restriction sites by doing a gradient optimization of the probability density of all maps for 

each fragment size. The first two derivatives of f(D\H) with respect to any single 
fragment size can be computed by a recurrence similar to AR r ^ , by taking the derivatives 

5 of the recurrence equations applying the normal chain rule. Outlined below is an 

algorithm that can compute the derivative for all fragment sizes in a single step only 2-3 
times as expensive as doing so for a single fragment size. 

This initial search stage, which constructs a genotype map H, is then 
followed by an additional search in which HI and H2, initially the same as the best H, are 

1 0 gradually modified by attempting to introduce a restriction site polymorphism at each site 
in HI or H2 (and also at locations between them) as well as restriction fragment length 
polymorphisms (RFLPs) for each fragment in HI or H2 and evaluating the complete 
probability density using Equation (0. 1). Attempting each new restriction site 
polymorphism involves modifying HI or H2 by adding or deleting a restriction site from 

1 5 HI (or H2) only, while attempting an RFLP involves modifying the same interval in HI 
and H2 by adding some delta to HI and subtracting the same delta from H2. In each 
case, 2 possible orientations of each polymorphism are possibly, reversing the use of HI 
and H2 above. Both orientations should be tested and the better scoring orientations 
selected, except when adding the first polymorphism to HI and H2. In this maimer the 

20 correct phasing of neighboring polymorphisms can be detected in a natural manner 

whenever possible. If the data cannot allow the phasing to be determined because there 
may be insufficient or no data molecules spanning both polymorphisms, both phases 
(orientations) can have the same score . This fact is also recorded since it marks a break 
in the phasing of polymorphisms, and the interval between such breaks can be referred to 

25 as a "phase contig." RFLP polymorphisms are more expensive to score, since in addition 
to the orientation (whether HI or H2 has the bigger fragment) estimates are generally 
made regarding the value of delta (the amount of the fragment size difference for HI and 
H2), which can involve some form of trial and error procedure. 

By testing a preliminary implementation of the above algorithm on 

30 simulated data, a purely greedy addition of polymorphisms to HI and H2 can get lodged 
in local maxima when two or more actual polymorphisms are in a close vicinity. For 
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example, if the trueHUha* a. 1 O.Kb fragment ihllowediy a JL Kh fragment while the true - 
H2 has an 11 Kb fragment in the same location, the correct solution is to add a restriction 
site polymorphisms to the initial conlig map at the right end of the 10-1 1 Kb fragment. 
However, given the possibility of sizing errors and missing small fragments of 1Kb, it is 
5 also possible to score this as a RFLP (the 10Kb vs. 1 1Kb) and the 1 Kb fragment being 
missing in half the data. By attempting both cases before committing to a change in HI 
and H2, the restriction site polymorphism can score slightly higher than the RFLP. This 
can be implemented by using a heuristic look ahead distance of a certain number of 
restriction sites, and scoring all alternate possible polymorphisms within this distance of 

10 the best scoring one, before committing the best scoring polymorphism in HI and H2. In 
general, it is possible to score all possible pairs (or triples) of polymorphisms in a local 
region, which would increase the search cost . 

Simple heuristics can be used to significantly accelerate the evaluation of 
Equation (0.1). First HI and H2 are typically modified in a single location at a time. 

15 Data maps are typically only 1-2 Mega bases, while a complete chromosome map 
represented by HI or H2 can be much larger. If a data map D } did not previously 

overlap HI or H2 anywhere near the location being modified, the conditional probability 
density terms f(Dj \ H t ) , can be reused for that data map from the last time it was 

evaluated. This effectively makes the cost of re-evaluating Equation (0.1) for a local 
20 change proportional to the coverage depth times ntj the number of restriction sites per 

map, rather than M D the total number of restriction sites in all data maps. Since all 
restriction sites should be considered 2-3 times until it is assured that no further 
improvements to HI and H2 are possible, this makes the total cost of the search for the 
best HI and H2 proportional to (M D /C)mC = Mjn , where m is the average number of 
25 sites per data map D, and C is the coverage depth. Since this usually dominates 

cost for the initial map H, the total cost remains roughly proportional to the 

total number of restriction sites in the data. 

Second, for the case of restriction site polymorphism, it is possible to 
accelerate the program by another factor of 2 by avoiding evaluating both possible 
30 orientations separately. Referring to Equation (0.2), i.e., that each hypothesis HI or H2 
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canpccur injust.tKQ versig^fer my particular restriction site; either the.restriction site, 
is present or it is absent For example, if previously both HI and H2 have a restriction 
site present Equation (0.2) is reevaluated first with the restriction site deleted from just 
HI, next with the restriction site deleted from just H2. Since it is possible to remember 
5 the previous values of these terms (with the restriction site present in HI and H2), these 
terms can be recomputed and recoded with the restriction site absent in both HI and H2 
and then perform the inexpensive averaging operations twice by combining the 
appropriate probability density terms already computed. It is also possible to evaluate the 
case where neither HI nor H2 have a restriction site at almost no extra cost, which can be 

10 the best option as a result of other changes in HI and H2 nearby. 

Both of these simple heuristics provide significant acceleration, while the 
resulting program can currently take about 2 hours per Mega base to search over the 
possible space of HI and H2. In the next section describes improvements to these 
algorithms in order to provide the acceleration of 20-140x or more, in addition to ways to 

IS parallelize the algorithms for a 16 or 96 processor Linux cluster. 

In a further embodiment of the present invention, a system, process and 
software arrangement to accelerate a single processor performance of the haplotype map 
assembly can be provided. An algorithm used by the system process and software 
arrangement for assembling the map and locating and phasing all polymorphisms in time 

20 proportional to o{mJ^ where M D is the total number of fragments in all input 

maps and m is the average number of fragments per input map has been described 
above. The second term dominates the time complexity for a human genome, and is due 
to the evaluation of the probability density repeatedly for different assumed parental 
chromosome maps HI and H2. An algorithm is described that will drop the complexity 

25 of the second term to 0{M D ) . However, this algorithm has a constant overhead that we 

estimate at about 4-6x. Hence, the potential speed up is likely to be about w/4 to m/6, 
which with an average fragment size of 1 5kB and an average molecule size of 2MB is 
about 20-30x. For an average fragment size of 2kB, the acceleration is even greater at 
over 15 Ox, and the time now remains proportional to the total number of input fragments 
3 0 and hence the total time increases by a factor of 7.5x. 
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. . Thi.s..ex.emplary embodiment provides a-festway-taxe^evahiate / (£>]&} -when H has-- - 
been changed locally in just one place in any of the following ways: 

1 . Delete one of the existing restriction sites in H. 

2. Add a new restriction site at a specified location in H. 

5 3. Increase or decrease one of the fragments (restriction site intervals) in H by a 

specified amount 

4. The first and second derivative of / (D \ H) relative to any fragment size in H. 

In all of the above cases, e.g., the cost of all such evaluations of f{D \ H) 

(or its derivatives) for all restriction sites spanning the molecule D can be done in just 2-3 
1 0 times the time it previously took to do just one such evaluation at one restriction site. 
This will allow the evaluation of Equations (0.1) (0.2) for all possible changes over a 
window of 2m restriction sites in time that is just 4-6 times greater than the cost for 
testing a single change at a single restriction site previously. The extra factor of 2 is due 

to the fact that the number of molecules D for which f{D \ H) is recomputed roughly 

■ 

IS double . 

The first step is to compute a new recurrence array AL IrJ which represents 

the sum of the probability densities of all those alignments between the part of H to the 
left of site I and the part of D to the left of site J, for which (I, J) is the rightmost aligned 
pair. As previously discussed the corresponding recurrence equation can be derived as 
20 follows: 

(0.5) AL U = VL U + (/ <; 0?0 : \)FM^ U AL^ + £ £ AL PjQ FA F ^ 

This array is preferably the mirror image of AR U , this recurrence array 
can be used to compute f(D\H) using an Equation similar to Equation (0.4). However, 
one exemplary reason to compute both AR^ and AL f J is that if H is changed locally 
25 near some restriction site H K , this will not change AR rj for I <K or AL tJ for I>K. 
It is possible to use mainly the parts of AR ItJ and AL U that didn't change to compute 
/ (£> | H ) . Then, the additional cost can be limited to re-computing the parts of AR U 
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_ BJid_AL SJm near JNK. In additipn^.some of this .cost of recomputing can be amortized - 

over different values of K, if the effect of local changes at consecutive restriction sites 
H x is simultaneously checked. To express / (D | H) in terms of both AR U and AL U 

so that those recurrence terms that do not change if we change H near H K are used, the 
5 following formulations are used: 

f K (D\H) = Vi (Alignments with rigfrnost aligned / £ K) + 



Pr (Alignments with leftmost aligned / > K)+ 
(0.6) Pr (Alignments with a fragment spanning [H K , H K ^ ]) 

K m¥\ N m+1 



I 



f(* < ?1 : O^^M^^, +1 
Z Z Z ALtjFKjf&^a 



All instances y4R v and AL }J used in Equation (0.6) remain unchanged if 

the interval H K *** H K+l m H is changed. Only the non-recurrence terms 

FA IJPtQ , FM K K ^ 9 URj j , VLjj change, and die modified forms of these terms can be 

10 computed in approximately constant time. 

The exemplary algorithms for each of the 4 cases are described below. To 
summarize, the computation cost in each of these 4 cases turns out to be: 

1 . To delete a restriction site from H: Total cost 6mA 3 for m restriction sites, 

2. To change the size of a restriction fragment in H: Total cost 6;hA 3 for changing 
15 each of m+1 restriction fragments by the same increment A^ . 

3. To add a restriction site at any point in H: Total cost 6iwA 3 + 4FA 4 to add one 
restriction site at T arbitrary locations. Note that to add one restriction site within 
each fragment (T=m), the total cost is about A times more expensive than for the 
previous 2 cases, since it is not possible to amortize the cost associated with the 

20 unique location of each new restriction site. 
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. 4.~ . Ta compute the first two derivatives of f(p | ff) relative -to- each ofm+1- - 

fragment sizes : Total cost 8mA 3 (slightly higher than for first two cases since we 
have to compute 2 derivatives). 

The 2 nd case (i.e., changing the size of a restriction fragment), the result is 
5 limited to the case when each fragment is changed by the same amount A„ , otherwise 

the computation cost is 4mA 3 +47>»A 2 + 2TA 4 . A possible strategy for finding RFLPs is 
to first check each fragment using a standard small value of A^ and -A H to check if an 

RFLP exists. Most fragments do not exhibit any RFLP. For the small number that do, a 
search can be performed for the optimal A^ value, using T different A tf , values over all 

10 fragments exhibiting an RFLP may have a computation cost of 4mA 3 + 47>nA 2 + 2TA 4 if 
it is started from scratch or a cost of just 47>«A 3 +2JA 4 if the arrays AL tJ and ARjj are 

saved for each data molecule from the first phase. Fox example, if 2 fragments are 
polymorphic it is possible to iterate 2-3 times with 1^=20 (10 A H values per fragment), 

thereby reducing the uncertainty in the optimal value of A^ by a factor of 10 in each 
15 such iteration. 

In the 4* case computing the two derivatives for each fragment may not be 
enough. In particular, all fragment sizes should be updated using some approximation to 
Newton's process, and iterate this a few times (4-10 typically) to insure convergence. 
Since the diagonal of the Jacobian (2 nd Gradient) is computed, the result may be unstable 

20 and suitable step size scaling may be preferable to insure convergence. It is also possible 
to compute a few off diagonal terms of the Jacobian (e.g., the first off diagonal terms 
resulting in a tri-di agonal Jacobian matrix), if this will accelerate convergence. 

The 3 rd case prefers to use a systematic way to decide where a new 
restriction site should be added Possible strategies may be to attempt 3-5 uniformly 

25 spaced locations inside each existing fragment OR every location for which a data 

molecule currently has a misaligned restriction site. It may be difficult to pick optimal 
locations in this maimer, and therefore may miss the true location, unless an improvement 
is observed in the value of the total probability density and subsequently optimize the 
location by optimizing fragment sizes (4 th case). In still a further embodiment, a 5* type 
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of locaLmodification^ and 4 ft cases. For 

each proposed new restriction site Hie new probability density can be computed as well as 
its first two derivatives relative to the location of the new restriction site, and then use a 
quadratic extrapolation of the probability density to score the new restriction site. 

5 In still a further embodiment of the present invention, the above described 

algorithms can apply to each molecule D independently, and they may be executed on a 
parallel Linux cluster by having each processor work on a separate molecule. It is 
preferable that each processor's workload is as balanced as possible. Since not all 
molecules would be of equal size, it may not be possible to obtain exact results. 

10 However, since it is possible to obtain a good estimate of the computation time as a 
function of the molecule size (m in the previous section), known bin-packing heuristics 
can be used to divide the set of molecules into 16 (or 96) groups that have similar total 
computation cost. For large data sets (300,000 molecules will be typical for a human 
genome at 50x coverage), the load balancing can be better than 95%. The bandwidth 

1 5 used between processors can be quite low since the final probabilities for each molecule 
D and possible local changes to map H are communicated to the master processor 
responsible for deciding how to modify H. 

Figure 3 shows an exemplary flow chart of an exemplary embodiment of 
the process according to the present invention for producing at least one haplotyped 

20 genome wide map. This process can be performed by a processing device, such as for 
example a compute: that includes a microprocessor. The processing device receives data 
310, which can be, for example, Optical Mapping data. Then, in step 320, the processing 
device prepares chromosome maps associated with at least one chromosome. In step 
330, a conditional probability density expression can be determined using the Optical 

25 Mapping data. Then, in step 340, a portion of at least one haplotyped genome wide map 
may be produced. In step 350, the processing device determines whether all portions of 
the at least one haplotyped genome wide map have been produced. If not, in step 360, a 
next portion of at least one haplotyped genome wide map can be produced. If all portions 
have been produced, in step 370, the process stops. 
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10 



To fecilitat&a Jirfheomderstandmg of the present invention, the - - 
following example of some of the preferred embodiments are provided In no way do 
such example be read to limit the scope of the invention. 

Example 1 

According to a process according to one exemplary embodiment of the 
present invention will now be described, an alignment probability expressions is provided 
that correspond to a good error model for Optical Mapping data: 

{Q.T>FA IJiPa = X Q - J -\i-P i ) p -'- x P d G(D a -Dj.Hr-H^-P?'-"' ) 

(0.8) FAf ltP s P 9 ''-*i 
(0.9) 

r w+i 

£ Jaw* if J * « 

pB»rm, + ^ ^pH^H, ^ i)/ fog p v , if / = ^ and y = m + 1 

0, otherwise 



UR 



P/' +P, (P/' -l)/logP y , If /=1 and /= 0 

0, otherwise 



Where, 



[Dj,H t -H p ,H g -H p y 
^(P=0?l:0)G(D y , J fir / ) 



, /2o 



G B (d,fi,b) = j-{erf 



d-h+b 



h-d 



^cr^maxfA-t.min^^J (^2max(/j-&,min(<z,A)) I 
Where P d is the digest rate, and hence (l -P, ) is the missing restriction 

site rate, A. is the false-positive site rate (sites per Mega base for example), o*h is the 
Gaussian sizing error variance for a fragment of size h, and P v is the probability mat a 
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fragment of unit size will.be nrissing-inJhe data, and Ji £ is the breakage rate-of the 

original DNA (the inverse of the average size of the DNA maps D). A C-style notation 
(conditioril\\Q) is used before a term that should be present if condition is true. 

Although it does not appear that UR f J or UL U can be computed in 

5 constant time, and likely 3-7 of the terms FR 1JtPQ or FL /JPQ (which can each be 

computed in constant time) are significant, these significant terms are stable and can be 
determined during an initial pass, and updated periodically as H1/H2 change. Equation 
(0.9) is provided under the assumption that each end of H is the end of a chromosome: 
the equations are likely simpler if H is an incomplete chromosome. 
1 0 Next a detailed description of processes for handling each of the four types 

of local modifications to the true map hypothesis H are described. In particular, 

1. Delete an existing restriction site from H. 

2. Add a new restriction site at a specified location in EL 

3 . Increase or decrease one of the fragments (restriction site intervals) in H by a 
IS specified amount 

4. The first and second derivative of f(D \ H) relative to any fragment in H. 

First, described below is a way to re-compute / (D \ H ) , while deleting 

one restriction site H K from H at a time for all possible K(l£K<, N). 

The first step is to derive an equation for f{D \ H) that uses only those 

20 parts of AR lvJ and AL SJ that will not change when H K is deleted, while excluding the 

probability for alignments that align with H K : 

f K (D\H) = Pr (Alignments with righmost aligned / < K)+ 

Pr( Alignments with leftmost aligned I > K) + 
(0,10) Pr(Alignments with a fragment spanning [H x ^ l9 H K ^ 



=ZZ^,/^ + Z 

/si j*o j=k+i y«=o 

\K < Ntl : 0)AL K _ u FM KW AR M +1 



tt+1 

X 



Z Z Z a^w^ 
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In.Equatioii-(0.1X)),-non^of the terras ARj^-ox AL U -change if the- - 
restriction site H K is removed from H. However, the tenns FA UtPtQ and UR IgJ and 
UL U change if H K is deleted. Referring to Equations (0.7X0.8X0.9), the change to 



FAijfG is simple, and involves a dropped term of, (l - P d ) . To see the effect on UR^ 
and UL U these are expanded in terms of FR IrJtPQ and FL lrIJt Q according to Equation 



(0.9) 











I 







lifted to obtain: 



/-I 



JV+1 N flH-1 



(0.11) 



w+1 



(* < : 0) AL K _ u FM x _^AR KUtJ +) 



K-l If ffl+1 



SEE 



Equation (0, 1 1) can then be modified to reflect the deletion of H K from H 
and corresponding changes in FA IJtPjQ , FR UJ9 ^ and FL U PiQ to obtain the following: 



£-1 w AT «+l 7-1 

m+l 



AT «+l 

si 



-z 



2ST-1 N »+l 



where, 



KJJJ 



0 



0 

<S>.\2) FLD w s \fl ijp 



iOT = .P-l 

if*:>p 
ifis:</' 



P+2 



ifA>.P+l 

As previously described, a small number ( A <, 8 ) of significant FRD or 



HIM 



r 




i 





arrays indexed by I and J is needed to be evaluated. Hence, the computation time of the 
first two summations in Equation (0.12) is approximately 2mA 2 , while the time for the 
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third-summation m Equation (0.-^ 

the original computation time of 2mA 3 , however the improvement can be greater if the 
equation is evaluated for all possible K(l£K<N\ since in that case the innermost 
terms FA IJtPXJ , FR UPQ and FL ItJPQ for different K are similar and may be evaluated 

only once. For example, any term in the third summation is likely tine same for all K s.t 
I<K<P and absent for all other YL Thus all possible terms in the third summation can 
be computed in a single pass, and each term added to the probability sum of a number of 
results f(D | H-H K ) for J <K<P. For each term, this can be done in constant time 

regardless of the range of possible K, an array of the differences of / (D | H - H K ) is 

computed for consecutive K, and each term is add at the start of the K-range and subtract 
at the end of the K-range in the array of differences. From the array of differences, the 

individual f(D\H-H K ) can be recovered at a later point Similar argument applies to 

the terms in the first two summations of Equation (0. 12), but each of the four variants 
involved should be computed and added to the corresponding four K ranges, which may 
only takes a constant amount of time. Thus, the overall time to evaluate f{D\H-H K ) 

for all K is approximately 2mA 3 plus the cost to pre-compute AL U and AR U , which are 

each also 2mA 3 . Thus, the total cost to compute all f(D \ H -H K ) is likely at most 3 

times the cost to compute just AR IJ9 hence it is possible to compute f(D\ H -H K ) for 

all K for just 3 times, and the cost to compute it for a single K. If enough memory is 
available in the computer executing this process or other memory is available, the cost of 
computing the complex terras FA ItJtPQ , FR J JJ9 Q and FL IvJ ^ can be shared between 

Equations (0.12)(0.5) and (0.3), which can reduce the total cost to perhaps just 2 times 
the cost to compute f(D\H~H K ) for a single K. 

The equivalent of the final Equation (0.12) for each of the remaining three 
local modifications to H is described below. 

Equation (0.13) shows the result for adding a restriction site at H T to EL 
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CQ43) . 



«+l 



ro+1 

where, 

Xff Tj a AR K _ X j (H K ^ -*H T ) see recurrence for -4i? /y 



/=ff /to 



/=0 



see recurrance for 



V 



FL4 



"'/,/ p l» 1 *H 

The notation (#* -±H T ) means to evaluate -41^ (using its 
defining equation provided previously) while replacing any occurrence of H K with H T . 
5 Equation (0. 13) preferably depends on the exact value of H T , e.g., only in the last 

summation term. All other summations can be done simultaneously for all K 
(\<iK<>N+\)a& described hereinabove for Equation (0.12). Only the last summation is 
preferably performed separately for each specific value of H T . Hence, the total 

computation time is preferably proportional to 6mA 2 + 4TA A , where T is the number of 
10 different values of H T for which / {D \ H+H T ) is computed 

Equation (0.14) shows the result for increasing one fragment size between 
H K and H KM by a specified amount A„ . 
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10.14) . 

AD Wi.T 



K m Af+l AT if +1 7-4 

f{D\H:H T ->H T +A }f VT>K) = YT l S AL lrl FRD KitJ ^ £ ZiZ»AX tJ FlD l 



IT N ja+1 ^ 



where, 

fad im s ^« ( * / ~> H p + A * ) 

[^(^i^ffM+A^ff^ffp+A,) if*<P-l 

^/^.iM -> + A* ) if * - P-1 

If the same value of A„ is used for all fragments, it is possible to evaluate 

5 Equation (0,14) for aU possible K(l<£ <//+l) in time proportional to 6mA 3 ,inthe 
same manner as described above for Equation (0. 12). On the other hand, if each L H 

value is different, Equation (0. 14) may be evaluated separately for each one at a cost 
proportional to 4mA 2 + 2A 4 . To this result the costs of pre-computing AR T J and AL U 

of 2mA 3 should be added. Hence, the total cost for T unrelated A^ values can be 

10 proportional to 4mA 3 + j(4mA 2 + 2A 4 ) . It may be possible to reduce this result to be to 

close to 4mA 3 + 2TA 4 , since most of the terms in the first two summations in 
Equation(0.I4) are likely to be negligible, except for a few terms when K is close to 
either end of H. This is the case for Equations (0J2) and (0.13) as well, while the cost of 
evaluating the terms of the first two summations are likely significant only in the current 
15 case, and even then likely only if m t A 2 . 

In order to compute the first two (d=l,2) derivatives of f(D j H) relative 
to all fragment sizes F K s ^ -H K ,0<>K<>N , Equation is (0.15) can be used 
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®F K ;«u-om+i /ejr+iJoip-o 3F.r 



d'FMR 



d*FML 



(0.15) +(* = : 0)^ +1 + (tf = 0 71 : 0)^ ^ 

at N oF 0 



f d d FM 

+£ \{K<Nn:0)AL Kr/ AR Mrf - 

/-o I 



S^+Z Z Z v^ 5 ^ 



The differential expressions in Equation (0. IS) can be computed as shown 
in Equations (0.16)(0.17)(0.18)(0.19X0.20)(0.21)(0.22) and (0.23). 



dF, 



d FMR 
&FML 



dF, 



d 



=FM NtlHl (R t +losP v )(]DgP v y- 1 
-iV v (^+]agP v )(bgi>) M 



dFR 



dF. 



dFL 



dF r 



d*FR^ 



(0.16) 



dF* 



d i FL 
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ifX^P 
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ifK<P 








if K = P 








if K>P 








ifK<P-l 








ifK = P-l 
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if K<P 








if K=P 
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FL4„. 




ifK>P 
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Example 2 

An application of one exemplary eml 



Lent of the present invention to a 
simulated data set is described below. For this exemplary embodiment, the basic map 
assembly algorithms is preferably extended by adding a post processing phase to 
carefully examine the component input maps that go into each consensus map, assign 
each input map to one of two populations and reassemble them into two separate 
10 consensus maps. This implementation uses simulated data to allow the performance for 
data error rates greater than present in actual data to be determined. 

To generate simulated data the first 5 megabases of human chromosome 
21 published by NIH can be used, and an in-silico restriction map may be generated for 
the restriction enzyme PacI, and then random errors are repeatedly introduced into this 
15 restriction map using the error rates described above and selected a random piece of 
between 1 .5 and 2.5 Megabases. This set of simulated data can represents one parental 
copy of chromosome 21. In order to generate the set for the other chromosome, the 5 Mb 
sequence can be randomly modified by inserting a random base modification to gjmvlatft 
SNPs and random insertions and deletions of about 3Kb (the current sizing error averages 
20 3Kb per 30Kb average restriction fragment, hence smaller insertions/deletions would 
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lfte!y.bt<Uffi^^ that.the number &£ SNPs that coincide with restriction • 

sites is approximately the same as the number of insertions and deletions. Such modified 
sequence can then be used to generate the second set of simulated maps, which 
correspond to the second parental copy of chromosome 21. The two sets of data may be 

5 combined in a 1 : 1 ratio and mixed together randomly. 

The system, process and software arrangement according to the present 
invention can generate, e.g., 2 consensus maps and assign input maps to either of these 
consensus maps or can leave them unassigned. The accuracy of the results can be scored 
by comparing them with the true in-silico maps (generated along with the simulated 

10 data). This procedure can be repeated for different amounts of simulated data 

corresponding to data redundancy of 6x, 12x, 16x, 24x and 5 Ox. Such redundancy can be 
measured per haplotype, and thus, the results for 6x redundancy generally corresponds to 
6x2x5 Mb of simulated data or 30 molecules of average size 2 Mb. Hie exemplary 
results are summarized in Table 2. To further understand these results row 4 (I6x 

15 Redundancy) can be reviewed. The last column shows that 80 molecules have been in 
the simulation. Of these molecules 71 molecules have been stated to be classified as 
belonging to one of the two phases (or haplotype variants). 2 errors were made and only 
69 molecules have been correctly classified. By comparing the two consensus maps 
generated by the software, a list of restriction sites classified as polymorphic (Le. a SNP 

20 was claimed by the software to exists at a restriction site) , has been generated and this 
list was then compared to the correct list of SNPs generated from the true in-silico maps. 
The column with the header "fp SNPs" shows the number of generated false-positive 
SNPs (i.e. extra incorrect SNPs) and, in this case the number is 2. The column with the 
header "fii SNPs" shows the corresponding number of false-negative SNPs (i,e, SNPs 

25 missed by the software), and in this case the number is 1 . Similarly for KFLPs (i.e. 
fragment size polymorphisms due to the simulated insertions/deletions), the numbers of 
false-positives is 0 and false-negatives is 12. The total numbers of correct SNPs and 
RFLPs are 16 and 24, respectively. 



Redundancy 


fp SNPs 


ftiSNPs 


fp RFLPs 


fh RFLPs 


Phase err 


Molecules 


6x 


5 


5 


1 


18 


7/26 


30 


12x 


4 


2 


4 


16 


2/55 


60 
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16x . . 


.2. . .__ 






.12 


-2/JZ1 


.-80- 


24x 


2 


1 


1 


11 


3/111 


120 


50x 


0 


1 


t 


5 


4/228 


250 


lOOx 


0 


0 


2 


1 


2/441 


500 



Table 2 ; Haplo typing algorithm performance for 16 SNPs and 24 RFLPs 



Exemplary statistics of errors in Haplotype maps is shown in Figure 4. 
These exemplary results show the system process and software arrangement according to 
the present invention can advantageously classify molecules to the right phase 
5 (haplotype) whenever redundancy was 12x or higher. However, to detect all the SNPs 
and RFLPs in the data additional redundancy may be used For example, at least 16-24x 
redundancy should be used to achieve 80% or more accuracy finding SNPs, and 5 Ox 
redundancy to achieve similar accuracy finding RFLPs. These results indicate that with 
50x data redundancy, it is possible to reliably detect most SNPs and over 80% of RFLPs 

1 0 for insertions/deletions equal to 1 standard deviation of the sizing error (currently 3Kb). 
The accuracy for larger insertions/deletions would Ekely be higher. 

Therefore, the exemplary embodiment of the system process and software 
arrangement according to the present invention is well-adapted to carry out the objects 
and attain the ends and advantages mentioned as well as those which are inherent therein. 

IS While the invention has been depicted, described, and is defined by reference to 
exemplary embodiments of the invention, such a reference does not imply a limitation on 
die invention, and no such limitation is to be inferred. The invention is capable of 
considerable modification, alteration, and equivalents in form and function, as will occur 
to those ordinarily skilled in the pertinent arts and having the benefit of tins disclosure. 

20 The depicted and described embodiments of the invention are exemplary only, and are 
not exhaustive of the scope of the invention. Consequently, the invention is intended to 
be limited only by the spirit and scope of the appended claims, giving full cognizance to 
equivalence in all respects. 
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What is claimed is: 

1 . A process for producing at least one haplotyped genome wide map, 
comprising the steps of: 

(a) preparing chromosome maps associated with at least one 
5 chromosome; and 

(b) producing a portion of at least one haplotyped genome wide map 
based on the chromosome maps. 

2. The process of claim 1, wherein the portion of at least one 
haplotyped genome wide map comprises at least one restriction site, 

10 3. The process of claim 1, wherein less than all subparts of the 

haplotyped genome wide map are produced in step (b) as ordered or unordered sets of 
contigs. 

4. The process of claim 1, wherein the chromosome maps are based 
on at least one single molecule map data set. 



15 5. The process of claim 1, wherein the haplotyped genome wide 

comprises two maps per chromosome associated with the at least one single molecule 
map data set 



6, The process of claim 4, wherein the at least one single molecule 
map data set has error rates as great as or smaller than: (i) about 10% error in distance 
20 between sites, (ii) about 20% missing sites, (Hi) about 7% false sites and (iv) about 50% 
of sites closer than about 1 kB apart that are approximately indistinguishable. 



7. The process of claim 4, wherein the at least one single molecule 



r 


m 


• 


El 



II nil 



s 



of polymorphic markers comprising at least one of restriction site polymorphisms, 
25 restriction length polymorphisms, insertions of bases, deletions of bases, and single 
nucleotide polymorphisms (SNPs). 
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„8 The. process, otclaim 4, wherein_the_at least xmejsinglejnolecule. 

data set combines at least two single molecule map data sets comprising different 
restriction site markers combined into a single haplo typed genome wide map, wherein all 
restriction site markers are combined and wherein the restriction site markers are 



5 distinguishable. 

9. The process of claim 1, further comprising the step of determining 
a conditional probability density expression based on the chromosome maps associated 
with at least one chromosome. 

1 0. The process of claim 9, wherein the probability density expression 
10 is based on errors provided in at least one single molecule map data set 

1 1 . The process of claim 1, further comprising the step of detecting 
substantially all site-based polymorphisms in the at least one haplotyped genome wide 
map. 



1 2. The process of claim 1 , further comprising the step of detecting 
IS substantially all interval-based polymorphisms in the at least one haplotyped genome 

wide map. 

13. The process of claim 1, wherein steps (a) and (b) are performed 
within a particular time, and wherein die particular time is a sub-quadratic function of a 
number of sites associated with an input data. 

20 14. A process for performing disease gene association based on at least 

one haplotyped genome wide map, comprising the steps of: 



producing at least one haplotyped genome wide map per patient; 

and 

performing the disease gene association based on the produced at 
25 least one haplotyped genome wide map. 

15. The process of claim 14, wherein the step of producing the at least 
one haplotyped genome wide map, comprises the sub steps of: 
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25 



chromosome; and 



(b) producing a portion of the at least one haplotyped genome wide 
p based on the chromosome maps. 



5 1 6. The process of claim 1 4, wherein the portion of at least one 

haplotyped genome wide map comprises at least one restriction site. 

17. The process of claim 14, wherein less than all subparts of the 
haplotyped genome wide map are produced in sub step (b) as ordered or unordered sets 
ofcontigs. 

10 18. The process of claim 14, wherein the chromosome maps are based 

on at least one single molecule map data set. 

19. The process of claim 14, wherein the haplotyped genome wide 
map comprises two maps per chromosome associated with the at least one single 
molecule map data set 

15 20. The process of claim 1 8, wherein the at least one single molecule 

map data set has error rates as great as or smaller than: (i) about 10% error in distance 
between sites, (ii) about 20% missing sites, (iii) about 7% felse sites and Qv) about 50% 
of sites closer than about 1 kB apart that are approximately indistinguishable. 

21 . The process of claim 1 8, wherein the at least one single molecule 
20 map data set consists of either Optical Mapping data or any single molecule ordered maps 
of polymorphic markers comprising at least one of: restriction site polymorphism, 
restriction length polymorphisms, insertions of bases, deletions of bases, and single 
nucleotide polymorphisms (SNPs). 



22. The process of claim 18, wherein the at least one single molecule 



restriction site markers is assembled into a single haplotyped genome wide map, wherein 





m 


V 
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, all jestdstfoji .site, markers .are combined and wherein the.restriction-sita makers are- - 
distinguishable. 

23. The process of claim 14, further comprising the step of 
detennining a conditional probability density expression based on the chromosome maps 

5 associated with at least one chromosome. 

24. The process of claim 23, wherein the probability density 
expression is based on errors provided in at least one single molecule map data set 

25. The process of claim 14, further comprising the step of detecting 
substantially all site-based polymorphisms in the at least one haplotyped genome wide 

10 map. 



26. The process of claim 14, further comprising the step of detecting 
substantially all interval-based polymorphisms in the at least one haplotyped genome 
wide map. 



27. The process of claim 14, wherein steps (a) and (b) are performed 
15 within a particular time, and wherein the particular time is a sub-quadratic function of a 
number of sites associated with an input data. 



28. A process for producing at least one genotyped genome wide 
comprising the steps of: 

(a) preparing chromosome maps associated with at least one 
20 chromosome; and 



ii mi 



(b) producing a portion of at least one genotyped genome wide 
based on the chromosome maps. 

29. The process of claim 28, wherein the portion of at least one 
genotyped genome wide map comprises at least erne restriction site. 



34- 



WO 2004/046889 



PCT/US2003/037114 



J0. v _ . The process jof. claim 28,„whecemlesBJhan alLsubparts of the - 
genotyped genome wide map are produced in step (b) as ordered or unordered sets of 
contigs. 

3 1 . The process of claim 28, wherein the chromosome maps are based 
S on at least one single molecule map data set. 

32. The process of claim 28, wherein the genotyped genome wide map 
comprises two maps per chromosome is assembled from at least one single molecule map 
data set 



33. The process of claim 3 1, wherein the at least one single molecule 
10 map data set has error rates as great as or smaller than: (i) about 10% error in distance 

between sites, (ii) about 20% missing sites, (iii) about 1% false sites and (iv) about 50% 
of sites closer than about 1 kB apart that are approximately indistinguishable. 

34. The process of claim 3 1, wherein the at least one single molecule 
map data set consists of either Optical Mapping data or any single molecule ordered maps 

1 5 of polymorphic markers comprising at least one of restriction site polymorphisms, 
restriction length polymorphisms, insertions of bases, deletions of bases, and single 
nucleotide polymorphisms (SNPs). 

35 . The process of claim 3 1, wherein the at least one single molecule 
data set combines at least two single molecule map data sets comprising different 

20 restriction site markers is assembled into a single genotyped genome wide map, wherein 
all restriction site markers are combined and wherein the different restriction site markers 
are distinguishable. 

36. The process of claim 28, further comprising the step of 
determining a conditional probability density expression based on the chromosome maps 

25 associated with at least one chromosome. 

37. The process of claim 36, wherein the probability density 
expression is based on errors provided in at least one single molecule map data set 
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. 35, The process of clam28 > jfijrthencoinprising the step^of detecting 

substantially all site-based polymorphisms in the at least one genotyped genome wide 



map* 

39. The process of claim 28, further comprising the step of detecting 

S substantially all interval-based polymorphisms in the at least one genotyped genome wide 
map. 

40. The process of claim 28, wherein steps (a) and (b) are performed 
within a particular time limit, and wherein the particular time is a sub-quadratic function 
of a number of sites associated with an input data. 

10 41 . A software arrangement which, when executed on a processing 

device, configures the processing device to produce at least one haplotyped genome wide 
map comprising the steps of: 



(a) preparing chromosome maps associated with at least one 
chromosome; and 



1 5 (b) producing a portion of at least one haplotyped genome wide 

based on the chromosome maps. 



MKIt 



42. The software arrangement according to claim 4 1 , wherein the 
portion of at least one haplotyped genome wide map comprises at least one restriction 
site. 



20 43. The software arrangement according to claim 41, wherein less than 

all subparts of the haplotyped genome wide map are produced in step (b) as ordered or 
unordered sets of contigs. 



44. The software arrangement according to claim 41 , wherein the 
chromosome maps are based on at least one single molecule map data set 



25 45 . The software arrangement according to claim 41 , wherein the 

haplotyped genome wide map comprises two maps per chromosome is assembled from 
die at least one single molecule map data set 
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46, The* software, arrang 



taccording to. claim.44, whereinihaai 



least one single molecule map data set has error rates as great as or smaller than: about 
10% error in distance between sites, about 20% missing sites, about 7% false sites and 
about 50% of sites closer than about 1 kB apart that ate indistinguishable* 



single molecule ordered maps of polymorphic markers comprising at least one of 
restriction site polymorphisms, restriction length polymorphisms, insertions of bases, 
deletions of bases, single nucleotide polymorphisms (SNPs). 



least one single molecule map data sets comprising different restriction site markers are 
assembled into a single haplotyped genome wide map wherein all restriction site markers 
are combined and wherein the restriction site markers can be distinguished. 

49. The software arrangement according to claim 41, further 
1 5 comprising determining a conditional probability density expression. 

50. The software arrangement according to claim 49, wherein the 
probability density expression is based on errors provided in at least one single molecule 
map data set, 

5 1 . The software arrangement according to claim 41, wherein 

20 substantially all site based polymorphisms are detected in the at least one haplotyped 
genome wide map. 



52. The software arrangement according to claim 41, wherein 
substantially all interval-based polymorphisms are detected in the at least one haplotyped 
genome wide map. 



5 




10 



48. The software arrangement according to claim 44, wherein the at 



25 
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. 54.. _ . . 1M zoft^$jmmg^znt according. to claim 41, further - 
comprising performing a disease gene association study based on at least one haplotyped 
genome wide map per patient. 

55 . A software arrangement which, when executed on a processing 

5 device, configures the processing device to perform disease gene association based on at 
least one haplotyped genome wide map per patient, comprising the steps of: 

(a) producing at least one haplotyped genome wide map per patient ; 

and 

(b) performing the disease gene association based on the produced at 
1 0 least one haplotyped genome wide map. 

56. The software arrangement according to claim 55, wherein the step 
of producing the at least one haplotyped genome wide map, comprises the sub steps of: 

(a) preparing chromosome maps associated with at least one 
chromosome; and 

15 (b) producing a portion of the at least one haplotyped genome wide 

map based on the chromosome maps. 

57. The software arrangement according to claim 56, wherein the 
portion of at least one haplotyped genome wide map comprises at least one restriction 
site. 

20 58. The software arrangement according to claim 56, wherein less than 

all subparts of the haplotyped genome wide map are produced in sub step (b) as ordered 
or unordered sets of contigs. 

59. The software arrangement according to claim 56, wherein the 
chromosome maps are based on at least one single molecule map data set 

25 60. The software arrangement according to claim 56, wherein the 

haplotyped genome wide map comprises two maps per chromosome is assembled from at 
least one single molecule map data set 
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JS1*_ „ Xhqjsoftware arrangement. according to claim 59 r wherem the at 
least one single molecule map data set has error rates at large as or smaller than: (i) about 
10% error in distance between sites, (ii) about 20% missing sites, (iii) about 7% false 
sites and (iv) about 50% of sites closer than about 1 kB apart that are approximately 
5 indistinguishable. 

62. The software arrangement according to claim 59, wherein the at 
least one single molecule map data set consists of either Optical Mapping data or any 
single molecule ordered maps of polymorphic markers comprising at least one of 
restriction site polymorphisms, restriction length polymorphisms, insertions of bases, 

10 deletions of bases, and single nucleotide polymorphisms (SNPs). 

63. The software arrangement according to claim 59, wherein the at 
least one single molecule data set combines at least two single molecule map data sets 
comprising different restriction site markers is assembled into a single haplotyped 
genome wide map, wherein all restriction site markers are combined and wherein the 

1 5 different restriction site markers are distinguishable. 

64. The software arrangement according to claim 56 3 further 
comprising the step of determining a conditional probability density expression based on 
the chromosome maps associated with at least one chromosome. 

65. The software arrangement according to claim 64, wherein the 

20 probability density expression is based on errors provided in at least one sing le molecule 
map data set 

66. The software arrangement according to claim 56, further 
comprising the step of detecting substantially all site-based polymorphisms in the at least 
one haplotyped genome wide map. 

25 67. The software arrangement according to claim 56, further 

comprising the step of detecting substantially all interval-based polymorphisms in the at 
least one haplotyped genome wide map. 
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_68 The software arrangement according, to claim-56, wherenwteps (a) 

and (b) are performed within a particular time limit, and wherein the particular time is a 
sub-quadratic function of a number of sites associated with an input data. 

69. A software arrangement winch, when executed on a processing 

5 device, configures the processing device to produce at least one genotyped genome wide 
map comprising the steps of: 

(a) preparing chromosome maps associated with at least one 
chromosome; and 

(b) producing a portion of at least one genotyped genome wide map 
1 0 based on the chromosome maps, 

70. The software arrangement according to claim 69, wherein the 
portion of at least one genotyped genome wide map comprises at least one restriction site. 

7 1 . The software arrangement according to claim 69, wherein less than 
all subparts of the genotyped genome wide map are produced in step (b) as ordered or 



IS unordered sets of contigs. 



72. The software arrangement according to claim 69, wherein the 
chromosome maps are based on at least one single molecule map data set 

73 . The software arrangement according to claim 69, wherein the 
genotyped genome wide map comprises two maps per chromosome associated with the at 

20 least one single molecule map data set 

74. The software arrangement according to claim 72, wherein the at 
least one single molecule map data set has error rates as large as or smaller than: about 
10% error in distance between sites, about 20% missing sites, about 7% false sites and 
about 50% of sites closer than about 1 kB apart that are indistinguishable. 

25 75. The software arrangement according to claim 72, wherein the at 

least one single molecule map data set comprises Optical Mapping data or any single 
molecule ordered maps of polymorphic markers comprising any of restriction site 
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polymorphisms, restric^onlengthpo deletions of bases, 

single nucleotide polymorphisms (SNPs). 

76 . The software arrangement according to claim 72, wherein at least 
one single molecule map data set comprising different restriction site markers are 

5 assembled into a single genotyped genome wide map wherein all restriction site markers 
are combined and wherein the different restriction site markers can be distinguished. 

77. The software arrangement according to claim 69, further 
comprising determining a conditional probability density expression. 

78. The software arrangement according to claim 77, wherein the 

10 probability density expression is based on errors provided in at least one single molecule 
map data set. 

79. The software arrangement according to claim 69, wherein 
substantially all site based polymorphisms are detected in the at least one genotyped 
genome wide map. 

IS 80. The software arrangement according to claim 69, wherein 

substantially all interval-based polymorphisms are detected in the at least one genotyped 
genome wide 



tlKll 



8 1 . The software arrangement according to claim 69, wherein steps (a) 
and (b) are performed within a particular time limit, and the particular time is a sub- 

20 quadratic function of a number of sites associated with an input data. 

82. The software arrangement according to claim 69, further 
comprising performing a disease gene association study based on at least one genotyped 

* 

genome wide map per patient. 



83 . A system for producing at least one haplotyped genome wide 
25 comprising a storage medium wherein the storage medium includes software that is 
executed to perform the steps of: 
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(b) producing a portion of at least one haplotyped genome wide 
based on the chromosome maps. 



5 



84. The system of claim 83, wherein the portion of at least one 



haplotyped genome wide map comprises at least one restriction site. 

85. The system of claim 83, wherein less than all subparts of the 
haplotyped genome wide map are produced in step (b) as ordered or unordered sets of 
contigs. 



on at least one single molecule map data set. 

87. The system of claim 83, wherein the haplotyped genome wide map 
comprises two maps per chromosome is assembled from the at least one single molecule 
map data set. 



map data set has error rates as large as or smaller than: (i) about 10% error in distance 
between sites, (ii) about 20% missing sites, (iii) about 7% false sites and (iv) about 50% 
of sites closer than about 1 kB apart that are approximately indistinguishable. 



20 map data set comprises Optical Mapping data or any single molecule ordered maps of 
polymorphic markers comprising at least one of restriction site polymorphisms, 
restriction length polymorphisms, insertions of bases, deletions of bases, and single 
nucleotide polymorphisms (SNPs). 

90. The system of claim 86, wherein the at least one single molecule 
25 data set combines at least two single molecule map data sets comprising of different 

restriction site markers is assembled into a single haplotyped genome wide map, wherein 



10 




15 
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all restriQtipQ.i3ite markers jre combined and wherein the .different restriction site markers 
are distinguishable. 

91 . The system of claim 83, further comprising the step of determining 
a conditional probability density expression based on the chromosome maps associated 

S with at least one chromosome. 

92. The system of claim 91 , wherein the probability density expression 
is based on errors provided in at least one single molecule map data set 

93 . The system of claim 83, further comprising the step of detecting 
substantially all site-based polymorphisms in the at least one haplotyped genome wide 

10 map. 

94. The system of claim 83, further comprising the step of detecting 
substantially all interval-based polymorphisms in the at least one haplotyped genome 

■ wide 



95. The system of claim 83, wherein steps (a) and (b) are performed 

1 5 within a particular time limit, and wherein the particular time is a sub-quadratic function 
of a number of sites associated with an input data. 

96. A system for performing disease gene association based on at least 
one haplotyped genome wide map per patient, comprising a storage medium wherein the 
storage medium includes software that is executed to perform the steps of: 

20 (a) producing at least one haplotyped genome wide map per 

patient; and 

(b) performing the disease gene association based on the 
produced at least one haplotyped genome wide map. 



97. The system of claim 96, wherein the step of producing the at least 
25 one haplotyped genome wide map, comprises the sub steps of: 
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iiiimti 



„(a) p^P.wng chromosome maps associated with atleast one 
some; and 



(b) producing a portion of the at least one haplotyped genome wide 
map based on the chromosome maps. 



5 98. The system of claim 97, wherein the portion of at least one 

haplotyped genome wide map comprises at least one restriction site. 



99. The system of claim 97, wherein less than all subparts of the 
haplotyped genome wide map are produced in sub step (b) as ordered or unordered sets 
of contigs. 



1 0 100. The system of claim 97, wherein the chromosome maps are based 

on at least one single molecule map data set. 

101, The system of claim 97, wherein the haplotyped genome wide map 
comprises two maps per chromosome is assembled Scorn at least one single molecule map 
data set. 

15 1 02. The system of claim 1 00, wherein the at least one single molecule 

map data set has error rates as large as or smaller than: (i) about 10% error in distance 
between sites, (ii) about 20% missing sites, (iii) about 7% false sites and (iv) about 50% 
of sites closer than about 1 kB apart that are approximately indistinguishable. 

1 03 . The system of claim 1 00, wherein the at least one single molecule 
20 map data set comprises Optical Mapping data or any single molecule ordered maps of 

polymorphic markers comprising at least one of restriction site polymorphisms, 
restriction length polymorphisms, insertions of bases, deletions of bases, and single 
nucleotide polymorphisms (SNPs). 

1 04. The system of claim 1 00, wherein the at least one single molecule 
25 data set combines at least two single molecule map data sets comprising different 

restriction site markers is assembled into a single haplotyped genome wide map, wherein 
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all restrictipn_site maoie^^e.cpiBbmed and:whereiathe different restriction site markers 
arc distinguishable. 

1 05 . The system of claim 97, further comprising the step of detennining 
a conditional probability density expression based on the chromosome maps associated 

5 with at least one chromosome. 

106. The system of claim 105, wherein the probability density 
expression is based on errors provided in at least one single molecule map data set, 

1 07. The system of claim 97, further comprising the step of detecting 
substantially all site-based polymorphisms in the at least one haplotyped genome wide 

10 map. 

108. The system of claim 97, further comprising the step of detecting 
substantially all interval-based polymorphisms in the at least one haplotyped genome 
wide map. 

109. The system of claim 97, wherein steps (a) and (b) are performed 

1 5 within a particular time limit, and wherein the particular time is a sub-quadratic function 
of a number of sites associated with an input data. 

110. A system for producing at least one genotyped genome wide map 
comprising a storage medium wherein the storage medium includes software that is 
executed to perform the steps of: 

20 (a) preparing chromosome maps associated with at least one 

chromosome; and 

(b) producing a portion of at least one genotyped genome wide map 

■ 

based on the chromosome maps. 

111. The system of claim 110, wherein the portion of at least one 
25 genotyped genome wide map comprises at least one restriction she. 
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_ 112.. . The_system of claim llD,.whereiuless than all subparts of the - 
genotyped genome wide map are produced in step (b) as ordered or unordered sets of 
contigs. 

1 13. The system of claim 1 10, wherein the chromosome maps are based 
5 on at least one single molecule map data set. 
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1 14. The system of claim 110, wherein the genotyped genome wide 
map comprises two maps per chromosome assembled from at least one single molecule 
map data set. 

115. The system of claim 113, wherein the at least one single molecule 
map data set has error rates as large as or smaller than: (i) about 10% error in distance 
between sites, (ii) about 20% missing sites, (iii) about 7% false sites and (iv) about 50% 
of sites closer than about 1 kB apart that are approximately indistinguishable. 



116. The system of claim 113, wherein the at least one single molecule 
map data set comprises Optical Mapping data or any single molecule ordered maps of 
polymorphic markers comprising at least one of restriction sites, restriction length 
polymorphisms, insertions of bases, deletions of bases, and single nucleotide 
polymorphisms (SNPs). 

117. The system of claim 1 13, wherein the at least one single molecule 
data set combines at least two single molecule map data sets comprising different 
restriction site markers is assembled into a single genotyped genome wide map, wherein 
all restriction site markers are combined and wherein the different restriction site markers 
are distinguishable. 

118. The system of claim 110, further comprising the step of 
detexmining a conditional probability density expression based on the chromosome maps 
associated with at least one chromosome. 

119. The system of claim 118, wherein the probability density 
expression is based on errors provided in at least one single molecule map data set 
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J20. . The^ystem.of claim 1 10, further comprising the step of detecting 
substantially all site-based polymorphisms in the at least one genotyped genome wide 
map. 

121 . The system of claim 110, further comprising the step of detecting 

5 substantially all interval-based polymorphisms in the at least one genotyped genome wide 
map. 

122. The system of claim 110, wherein steps (a) and (b) are performed 
within a particular time limit, and wherein the particular time is a sub-quadratic function 
of a number of sites associated with an input data, 
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